An Introduction to Machine Learning
University of Zurich
Definition
The design of algorithms that learn from data and are capable of improvement as new experiences emerge. Specifically, the algorithm “is said to learn from experience \(E\) with respect to some task \(T\) and performance measure \(P\), if its performance at task \(T\), as measured by \(P\), improves with experience \(E\)” (Mitchell, 1997, p. 2).
The poles are:
The main uses of machine learning in (applied) social science are the:
Traditional
Data
Unit
Variable
Outcome
Machine Learning
Set
Sample, example, or instance
Feature
Label
Model
An algorithm learns the model
\[ y_i = \phi \left(\boldsymbol{x}_i,\boldsymbol{\theta} \right) + \varepsilon_i \qquad(1)\]
\(i\) is the instance.
\(y\) is the label (a class or a score).
\(\phi(.)\) is an unknown function.
\(\boldsymbol{x}\) is the vector of predictive features.
\(\boldsymbol{\theta}\) are the unknown (tuning) parameters.
\(\varepsilon\) is irreducible error.
Prediction
\[ \psi_i = f \left( \boldsymbol{x}_i^*, \boldsymbol{q} \right) \qquad(2)\]
\(\psi\) is the predicted value.
\(f(.)\) is the learnt functional form.
\(\boldsymbol{x}^*\) is the subset of relevant features \(\Rightarrow\) feature selection.
\(\boldsymbol{q}\) contains the optimal parameter values.
In statistics, loss = error.
For at least one instance, \(\psi_i \neq y_i\).
The loss function quantifies the totality of error in the scalar \(L \left( \psi_i,y_i \right)\).
This quantity plays a crucial role in the learning process, which selects \(f(.)\) and \(\boldsymbol{q}\) in such a way that loss is minimized.
Later, we shall see how this is done.
Performance metrics measure how well the model accounts for the data and are given by \(P \left( \psi_i, y_i \right)\).
In some cases, they are derived directly from the loss function; in other cases, they follow their own logic (see Japkowicz & Shah, 2011).
Regardless, our goal is to achieve outstanding predictive performance (but see Thomas & Uminsky, 2020).
Training = the process of learning the functional form, parameters, and relevant features.
Evaluation = the process of assessing model performance.
Cardinal rule: Never train and evaluate on the same data!
Instead, we generate:
A training set for training purposes
A test set for evaluation purposes
If necessary randomize the \(n\) instances.
Set aside a fraction of \(p\) instances for training \(\Rightarrow\) training set of size \(n_1 = p \cdot n\).
Set aside a fraction of \(1-p\) instances for evaluation \(\Rightarrow\) test set of size \(n_2 = n - n_1\).
Although performance remains of paramount importance, interpretation becomes ever more important in machine learning.
In the words of Ribeiro et al. (2016, p. 1), “if the users do not trust a model or a prediction, they will not use it.”
Interpretation can be easier or more complex, depending on the nature of the algorithm.
tidymodelsSource: RPubs
The familiar regression model can be cast as a machine learning algorithm.
It is limited to regression tasks.
It is also somewhat atypical:
\(\phi(.)\) is opposed instead of learnt.
There are no tuning parameters.
There is no feature selection.
Because of its simplicity, however, it is a useful didactic starting point.
Specification
\[ y_i = \beta + \sum_{j=1}^P \omega_j x_{ij} + \varepsilon_i \qquad(3)\]
\(\beta\) is the bias (a.k.a. constant or intercept).
\(\omega\) is the weight (a.k.a. partial slope coefficient).
Note: Figures are based on the iris data, specifically the Versicolor subset. X1 = sepal width, X2 = petal width, and y = sepal length.
For regression tasks, several loss functions can be formulated.
Two common functions are:
Absolute loss: A.k.a. the (squared) \(L_1\)-norm, the goal is to minimize \(\sum_{i=1}^{n_1} \lvert \psi_i - y_i \rvert\).
Quadratic loss: A.k.a. the (squared) \(L_2\)-norm, the goal is to minimize \(\sum_{i=1}^{n_1} \left( \psi_i - y_i \right)^2\).
Quadratic loss is the foundation of ordinary least squares.
Absolute loss is used to minimize the impact of outliers.
For regression tasks, three metrics are commonly computed over the test data:
A typical approach is to show a partial dependence plot.
Here, we distinguish between
One or two focal predictive features, \(\boldsymbol{x}_f\).
Controls, \(\boldsymbol{x}_c\).
We average the predicted values across the controls
\[ f(\boldsymbol{x}_f,\boldsymbol{q}) = \mathbb{E}_{\boldsymbol{x}_c} \left[ f(\boldsymbol{x}_c,\boldsymbol{x}_f,\boldsymbol{q}) \right] = \int f(\boldsymbol{x}_c,\boldsymbol{x}_f,\boldsymbol{q}) dp(\boldsymbol{x}_c) \]
2023 World Happiness Report (Helliwell et al., 2023).
Gallup World Poll data.
Label: “Please imagine a ladder with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”
Log per capita GDP.
Social support from friends in troubled times.
Healthy life expectancy.
Freedom to make life choices.
Generosity, i.e., charitable donations.
Perceptions of corruption.
rsamplerecipesFor a regression problem, there really is often not any pre-processing needed.
The predictive feature per capita GDP is typically skewed and might require a transformation, but that was already done for us.
Hence, we skip this element for now and return to it shortly.
parsnipWe specify the algorithm, any tuning parameters, and the mode (i.e., task).
We set the external learning engine (= implementation of the algorithm) using set_engine.
We declare the model formula and training set using fit.
Algorithms and engines can be found at search parsnip models.
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.37 1.01 -2.35 0.0217
2 logpercapgdp 0.155 0.0874 1.77 0.0808
3 socialsupport 4.45 0.704 6.32 0.0000000185
4 healthylifeexpectancy 0.0326 0.0177 1.84 0.0691
5 freedom 1.91 0.627 3.04 0.00325
6 generosity 0.209 0.410 0.508 0.613
7 perceptionsofcorruption -1.07 0.439 -2.45 0.0169
yardstick# A tibble: 56 × 8
.pred happiness logpercapgdp socialsupport healthylifeexpectancy freedom
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.34 5.33 9.30 0.855 66.5 0.571
2 6.00 6.02 9.96 0.891 67.2 0.823
3 6.68 6.86 10.8 0.915 70.9 0.825
4 3.45 4.37 8.10 0.437 56.1 0.743
5 5.71 5.63 9.62 0.880 67.3 0.746
6 4.09 4.64 7.67 0.663 55.5 0.696
7 4.18 4.97 8.22 0.686 55.8 0.686
8 7.01 6.96 10.8 0.929 71.4 0.874
9 6.07 6.33 10.1 0.889 70.3 0.792
10 5.65 5.63 9.58 0.822 69.3 0.804
# ℹ 46 more rows
# ℹ 2 more variables: generosity <dbl>, perceptionsofcorruption <dbl>
DALEX = Descriptive mAchine Learning EXplanations (Biecek, 2018)
library(DALEXtra)
happy_explainer <- explain_tidymodels(
model = happy_fit,
data = test_set,
y = test_set$happiness
)Preparation of a new explainer is initiated
-> model label : model_fit ( default )
-> data : 56 rows 7 cols
-> target variable : 56 values
-> predict function : yhat.model_fit will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package parsnip , ver. 1.1.0 , task regression ( default )
-> predicted values : numerical, min = 3.025659 , mean = 5.443398 , max = 7.56488
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -1.065165 , mean = 0.120227 , max = 1.130251
A new explainer has been created!
pdp_gdp <- model_profile(
happy_explainer,
variables = "logpercapgdp",
N = NULL
)
as_tibble(pdp_gdp$agr_profiles) %>%
ggplot(aes(`_x_`, `_yhat_`)) +
geom_line(size = 1.2, alpha = 0.8, color = "#31688EFF") +
geom_rug(sides = "b", col = "#31688EFF") +
labs(
x = "Log Per Capita GDP",
y = "Predicted Happiness",
color = NULL
) +
theme_bw()You just completed your first machine learning project.
However, we took a shortcut by working with a subset of the data, instead of the whole file.
Our code was not optimal since we did not create a work flow.
country happiness stderr logpercapgdp socialsupport
1 Afghanistan 1.859 0.033 7.324 0.341
2 Albania 5.277 0.066 9.567 0.718
3 Algeria 5.329 0.062 9.300 0.855
4 Andorra NA NA NA NA
5 Angola NA NA NA NA
6 Antigua and Barbuda NA NA NA NA
healthylifeexpectancy freedom generosity perceptionsofcorruption continent
1 54.712 0.382 -0.081 0.847 3
2 69.150 0.794 -0.007 0.878 4
3 66.549 0.571 -0.117 0.717 1
4 NA NA NA NA 4
5 NA NA NA NA 1
6 NA NA NA NA 2
ISO3
1 AFG
2 ALB
3 DZA
4 AND
5 AGO
6 ATG
Compared to the earlier example,
There are missing data.
We have a column country identifying the country, which is not a predictive feature.
We have a column stderr, which is the standard error on the happiness scores and will be ignored.
We have a column continent, which looks numeric but is a factor.
We have a column ISO3 that is fully redundant with country.
Let us first,
Drop ISO3 and stderr.
Turn continent into a proper factor.
Drop missing values (since we have no information about them except for their location).
Next, we should create a recipe that
Creates dummies for continent, since that is what the regression needs.
Ensures that country will not be used as a predictor but remains available to flag instances.
recipes Process Data?recipe call screens the data for types (numeric, factors, strings, and Booleans), as well as roles (e.g., outcome or predictor).Work flows are a very efficient way to build machine learning projects.
They recognize that modeling is not just about fitting a model, but also about various pre- and post-processing steps.
As such, work flows include elements from parsnip, recipes, and possibly other packages.
# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.23 1.27 -1.76 0.0832
2 logpercapgdp 0.152 0.0944 1.61 0.111
3 socialsupport 3.50 0.724 4.83 0.00000766
4 healthylifeexpectancy 0.0263 0.0205 1.28 0.204
5 freedom 2.72 0.626 4.34 0.0000474
6 generosity 0.751 0.486 1.55 0.127
7 perceptionsofcorruption -0.579 0.449 -1.29 0.201
8 continent_Americas 0.334 0.240 1.40 0.167
9 continent_Asia -0.0611 0.205 -0.298 0.767
10 continent_Europe 0.280 0.250 1.12 0.267
11 continent_Oceania 0.163 0.556 0.293 0.771
Our training problem was simple:
Few training instances.
Few predictive features.
A simple loss function.
This means we can do everything using calculus.
Most machine learning problems are complex with lots of data and difficult loss functions.
We want optimizers that
Are simple in terms of the calculus involved.
Can be evaluated numerically.
Require us to process little data at a time.
These notes show how this can be achieved.
Consider two standardized features.
The data generating process is \(\psi_i = w x_i\).
| x | y |
|---|---|
| 0.543 | 0.671 |
| -0.917 | -0.682 |
| 1.458 | 1.421 |
| -0.267 | -0.896 |
| -0.817 | -0.514 |
We minimize \(w = \text{argmin} \ L_2\).
The calculus recipe is:
Obtain the gradient: \(\nabla = \frac{dL_2}{dw}\).
Set to 0 and solve for \(w\).
This yields \(w = 0.930\).
We can bypass much of the math and summation and will show this in two steps:
The result is some form of gradient descent—the computational workhorse of machine learning.
Make an initial guess of \(\boldsymbol{\theta}\); we call this the starting value.
Update so that the \(t\)th estimate is
\[ \boldsymbol{\theta}^t = \boldsymbol{\theta}^{t-1} - \alpha \boldsymbol{\nabla}^{t-1} \] where \(\alpha\) is the learning rate.
Repeat (iterate) until convergence.
Batch gradient descent bypasses the problem of analytically solving \(\boldsymbol{\nabla} = \boldsymbol{0}\).
However, we still have to sum over all of the data and, with big data, this can be slow.
What if we could sample a single instance and update on that basis \(\rightarrow\) stochastic gradient descent.
Recipe:
Set the starting values.
Randomly select an instance from the training set and update using \(\boldsymbol{\theta}^t = \boldsymbol{\theta}^{t-1} - \alpha \boldsymbol{\nabla}_i^{t-1}\).
Make many passes through the training data; each pass is called an epoch.
Repeat this process until convergence has been reached.
Batch gradient descent uses all training instances in an iteration:
Advantage: Fast convergence.
Disadvantage: Requires sums.
Stochastic gradient descent uses one training instance at a time:
Advantage: No sums.
Disadvantage: Slower convergence.
A compromise—mini batch gradient descent—is to utilize small batches of data: